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Abstract 

A fundamental goal in cellular signaling is to understand allosteric communication, the process by which sig- 
nals originating at one site in a protein propagate reliably to affect distant functional sites. The general principles 
of protein structure that underlie this process remain unknown. Statistical coupling analysis (SCA) is a statistical 
technique that uses evolutionary data of a protein family to measure correlation between distant functional sites 
and suggests allosteric communication. In proteins, very distant and small interactions between collections of 
amino acids provide the communication which can be important for signaling process. In this paper, we present 
the SCA of protein alignment of the esterase family (pfam ID: PF00756) containing the sequence of antigen 85C 
secreted by Mycobacterium tuberculosis to identify a subset of interacting residues. Clustering analysis of the 
pairwise correlation highlighted seven important residue positions in the esterase family alignments. These resi- 
dues were then mapped on the crystal structure of antigen 85C (PDB ID: 1DQZ). The mapping revealed corre- 
lation between 3 distant residues (Asp38, Leul23 and Metl25) and suggests allosteric communication between 
them. This information can be used for a new drug against this fatal disease. 

Keywords: antigen 85C, Mycobacterium tuberculosis, clustering analysis, covariance, statistical coupling analy- 
sis, esterase family, multiple sequence alignments, pfam, Protein Data Bank. 



INTRODUCTION 

Communication between distant sites in proteins is 
fundamental to their function and often defines the bi- 
ological role of a protein family. In signaling proteins, 
it represents information transfer — the transmission of 
signals initiated at one functional surface to a distinct 
surface mediating downstream signaling. For exam- 
ple, ligand binding at an externally accessible site in 
G protein-coupled receptors (GPCRs) reliably triggers 
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structural changes at distant cytoplasmic domains that 
mediate interaction with heterotrimeric G proteins 11 ' 21 . 
Studies in many other protein systems indicate that 
long-range interactions of amino acids also are im- 
portant in binding (and catalytic) specificity. Substrate 
recognition in the chymotrypsin family of serine pro- 
teases [3,4] , the tuning of antibody specificity through 
B-cell maturation [5] and the cooperativity of oxygen 
binding in hemoglobin [6] all depend not only on resi- 
dues directly contacting the substrate, but also on 
distant residues located in supporting loops and other 
secondary structural elements. 

Statistical coupling analysis (SCA) is a technique 
used to identify communication between distant sites 
in proteins. More specifically, it quantifies how much 
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the amino acid distribution at some position / changes 
upon a perturbation of the amino acid distribution at 
another position j. The resulting statistical coupling 
energy indicates the degree of evolutionary depend- 
ence between the residues, with higher coupling en- 
ergy corresponding to increased dependence. Pre- 
vious application of SCA done on the PDZ domain 
family [(PDZ is an acronym combining the first let- 
ters of three proteins — post synaptic density protein 
(PSD95), Drosophila disc large tumor suppressor 
(DlgA), and zonula occludens-1 protein (zo-1)]. These 
PDZ domain family proteins were first discovered to 
share the domain [7] . Apart from sharing of domains, 
this protein family has also got a predicted set of en- 
ergetically coupled positions for the binding site resi- 
due that contained unexpected long-range interactions. 
Further mutational analysis confirmed the prediction 
that statistical energy function is a good indicator 
for thermodynamic coupling in proteins [8] . Applica- 
tion of SCA to three structurally and functionally dis- 
tinct protein families (GPCR, the chymotrypsin class 
of serine proteases and hemoglobin) revealed a sim- 
ple architecture for amino acid interaction in protein 
families and link distant functional sites in the terti- 
ary structure [9] . Further application to the SI A serine 
protease family indicated the presence of quasi-inde- 
pendent groups of correlated amino acids termed as 
"Protein Sectors". Each of these sectors was found to 
be physically connected in the tertiary structure, had a 
distinct functional role and constituted an independent 
mode of sequence divergence in the protein family [10] . 

The antigen 85 (ag85) complex, composed of three 
proteins (ag85 A, B and C), is a major component of 
the Mycobacterium (M.) tuberculosis cell wall. Each 
protein possesses a mycolyltransferase activity re- 
quired for the biogenesis of trehalose dimycolate (cord 
factor), a dominant structure necessary for maintain- 
ing cell wall integrity [11,12] . The protein sequence of 
ag85 C, which is a member of the esterase family pro- 
teins, seems to be responsible for the high affinity of 
Mycobacterium to fibronectin [12] . Since ag85 proteins 
are important for cell wall biosynthesis, it could be a 
target for a novel anti tubercular drug. With the aim 
of identifying residue positions, which could be re- 
sponsible for the function of ag85 C and hence act as 
potential drug targets for M. tuberculosis, we applied 
SCA to the esterase protein family. 

MATERIALS AND METHODS 

Methods 

The SCA was performed as described by Halabi 
et al™. Briefly (details in Results), each site i in the 



multiple sequence alignment (MSA) was assigned a 
sequence conservation parameter D- a \ As an example, 
the D- a) scores for one of the test proteins were plotted 
in Fig. 1. The correlated mutation score C/ b expresses 
difference between the D- a) and in the perturbed 
alignment where the amino acid j was constrained. 
Instead of choosing particular residue positions, which 
constrain the MSA, a series of perturbations were car- 
ried out by sequentially eliminating a sequence at a 
time from the alignment. The fluctuation of sequence 
conservation at each site is recorded to give perturba- 
tion trajectories for each amino acid at each site. Sites 
which are not evolutionarily coupled are expected to 
show independent patterns of fluctuation, while sites 
that are coupled are expected to show some mutual 
dependence in their fluctuations. The final output is a 
matrix of sequence alignment length contacting scalar 
coupling value for each pair of positions. 

Datasets 

The putative esterase family (pfam ID: PF00756 
alignment (of 4945 sequences) was downloaded (up- 
dated till April 2010) from the pfam database (http:// 
pfam.sanger.ac.uk/) [10] . Sequence redundancies were 
removed from the database at the threshold value of 
99% using JalView [14] (846 out of 4945 sequences 
left). The dataset was realigned using ClustalX and no 
further manual adjustments were done [15] . The dataset 
was further refined by the removal of all columns with 
more than 20% gaps (using MATLAB) (149 columns 
left). This prevented any trivial over-representation of 
gaps in the alignment and ensured that the calculations 
were only made at largely non-gapped sequences po- 
sitions. SCA was performed using the code provided 
by Ranganathan [10] . Clustering analysis was performed 
in the form of hierarchical clustering analysis by con- 
struction of dendrograms of the correlation matrix 
using MATLAB. Coupled residues were mapped on 
to the structure of the secreted form of ag85 C and the 
crystal structure of the secreted form of ag85 C (PDB 
ID: 1DQZ) was available on the PDB database for 
structural analysis. 

RESULTS 

The conservation of an amino acid a at position / in 
a multiple sequence alignment is defined by D- a) the 
divergence (or relative entropy) of the observed fre- 
quency of a at / (fi ) from the background frequency 
of a in all proteins (q {a) ) m . 

Aa) -i _r(a) 

£>, <a) =f ) ln^—+ (\-f y ) ln-!-^-T 

1 J 1 (a) \ Ji y _ (a) 

t[ J_ t[ 

D- a) is a non-linear function of f- a) that rises more and 
more steeply as f- a) approaches one. As a practical 
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consequence, for all but the least conserved positions, 
the overall conservation of all amino acids at each po- 
sition / is well approximated by D{ a , the conservation 
of a h the most prevalent amino acid at that position [13] . 

Fig. 1 shows the positional conservation of all the 
149 residue positions (left after removal of redun- 
dancy and gap reduction on the original downloaded 
MSA. We call the new alignment: "truncated align- 
ment") in the truncated alignment of the esterase fam- 
ily. It also shows the distribution of D- a) values of all 
the positions as a measure of conservation. 

Histogram of pairwise correlation values of the 149 
residue positions as shown in Fig. 1 indicates that most 
of the values (nearly 99%) lie close to the mean value 
of the histogram and only a few pairs show significant 
correlation and only about 1% of the total pairs show 
significant correlation (values greater than e Xo + 236a ) 
This indicates that only a few pairs of residues corre- 
late with each other with high correlation values. This 
essentially explains that in putative esterase protein 
family only a few residue pairs show interaction and 
communicate with each other even from a long range. 
According to the CASP guidelines (http://www.pre- 
dictioncenter.llnl-.gov/CASP9), a "long-range" inter- 
residue contact is defined as two residues that are sep- 
arated by at least nine residue positions in the linear 
sequence in which CP-CP distance is ^ 8 A [16] . Fig. 2 
shows the correlation values. 

The result of the correlation is shown in terms of 




150 

Positions 

Fig. 1 Sequence conservation in the esterase family 
multiple sequence alignment. The degree of sequence 
conservation is plotted along the protein sequence with high 
values indicating high conservation. The plot corresponds to 
the diagonal elements. 

a correlation matrix, which is represented as a heat 
map (Fig. 3). The basic principle of the SCA correla- 
tion matrix, is to weigh the frequency-based correla- 
tions between positions / and j, C^ b -fif b) -ft a) fj b \ by 
a function of their positional conservations which is 
given by D/ a) and D t {b \ 

c; b = 0(D^ a) )0(Dl b) )c; b 




Correlation value 

Fig. 2 Histogram of pairwise correlation values. The 

figure suggests that only a few pairs correlate with a significant 
correlation where as most of the correlation values (99%) lie 
around the mean value of 0.040978. 
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Fig. 3 Heat map of the correlation matrix. The red pix- 
els represent high correlation values and as we scale down the 
blue pixels represent low correlation values. 

Thus Cif is a measure of the significance of ob- 
served correlations as judged by the conservation of 
the amino acids under consideration. Following the 
earlier work, we chose the weighting functions here to 
be gradients of positional conservation: cp =dD I df mo \ 

For a correlation matrix M, the value M(i,j) would 
mean the correlation value between the ith and jth res- 
idue in the multiple sequence alignment (MSA). The 
heat map in Fig. 3 is a representation of this correlation 
matrix for the esterase family alignments where most 
of the pixels are blue to indicate that most of the cor- 
relation values are significantly low. However, there 
are few red spots, highlighting the significant correla- 
tions. The diagonal values represent auto-correlation 
values, which is a representation of the conservation 
of that particular position. In order to identify the pairs 
of residues with high correlation values, we performed 
hierarchical clustering. This is a method of cluster 
analysis, which seeks to build a hierarchy of clusters. 
Hierarchical clustering enables grouping of the data 
over a variety of scales by creating a cluster tree or 
dendrogram. The tree is not a single set of clusters, 
but rather a multilevel hierarchy, where clusters at 
one level are joined as clusters to the next level [17] . 
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Fig. 4 represents the result obtained by performing 
hierarchical clustering on the pair-wise correlation. 
Values obtained from the SCA of the esterase family 
alignments as can be seen in Fig. 4 where a cluster of 
7 residues (in green) was identified. Positions 22, 47, 
67, 92, 94, 121 and 146 from the truncated alignment 
formed the cluster of closely related residue position, 
which correlated with a high correlation value. This 
essentially means that these residue positions are sta- 
tistically correlated to one other with high correlation 
value. Our hypothesis is that these are significant to 
the esterase family protein functions. 




52 89 98 99 73125 80 71 72 21 50 1102214647 67 92 94 121 
Residue position 

Fig. 4 Dendogram representing clustering analysis 
of pairwise correlation values. The cluster represented in 
green is the cluster with the most significantly coupled residue 
positions. X-axis represents the residue position in the trun- 
cated alignment and Y-axis represents the clustering distance 
between any two-residue positions. The arrow indicates a clus- 
ter of seven residues. 

Esterase family alignments contain the sequence of 
ag85 C secreted by M. tuberculosis. We mapped these 
residue positions from the truncated alignment onto 
the crystal structure of the secreted form of ag85 C 
manually. Fig. 5 represents the mapping of seven resi- 
dues. There are a total of 8-282 residues, the apo form 
of ag85 C (PDB accession 1DQZ) 

As can be seen in Fig. 5, the residues that have been 
mapped are Asp 38, Val 66, Phe 98, Leu 123, Met 
125, Trp 186 and Ser 215. A close look at the struc- 
tural positions of these residues highlights some inter- 
esting and important information about these residues 
and ag85 C. Three of the seven residues (shown in red 
in Fig. 5) seem to be physically interacting in the form 
of electrostatic interactions, and further Asp38 and 
Metl25, Asp38 and Leul23 also qualify for the CASP 
definition of long-range residues [16] . Fig. 6 shows other 
esterase. 




Fig. 5 Mapping of the 7 residue positions highlighted 
by SCA onto the crystal structure of antigen 85 C 

(PDB ID: 1DQZ). The residues shown are Asp38, Val66, 
Phe98, Leul23, Metl25, Trpl86 and Ser215. The pairs in red 
Asp38, Leul23 and Asp38, Metl25 seem to show physical in- 
teraction through allostery. 




Fig. 6 Structures of the Mycobacterium tuberculosis 
30 kDa major secretory protein (Antigen 85B). A my- 

colyl transferase (PDB ID: 1FOP), another esterase showing 
similar pocket. 

DISCUSSION 

In this work, we have presented the SCA of MSA 
of putative esterase family of proteins, which is known 
to contain several seemingly unrelated proteins. Three 
(Asp38, Leul23 and Metl25) out of seven residues as 
identified by SCA seem to interact physically as seen 
from the mapping on the crystal structure of ag85 C 
(Fig. 5). The pair Arg 38 and Met 125 has been previ- 
ously reported to be important for the function of ag85 
C. Arg38 is one of the residues, which form a pocket 
with a high negative electrostatic potential. Ag85 
C when complexed with a covalent inhibitor impli- 
cates residues Leu40 (close to Arg38) and Metl25 
as components of oxy anion hole [10] . Mutating such 
a pair may result in the non-functioning of ag85 C. 
The functioning of ag85 C is crucial for the survival 
of M. tuberculosis in host environment 181 . The results 
of this investigation suggest that Arg38 and Metl25 
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can be potential targets for antituburcular drugs, as al- 
losteric changes in the structure of these two residues 
may lead to the non-functioning of ag85 C and hence 
inactivation of the organism. Active site of ag85 C il- 
lustrates the binding mode of the substrate and extends 
the knowledge concerning specific protein/substrate 
interactions. This is the first step toward designing 
potent inhibitors to the mycolyltransferase activity of 
the ag85 enzymes [13] . 
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